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ABSTRACT 


It  Is  shown  that  the  cepstrum  is  symmetric  about  a 
quefrency,  termed  the  folding  quefrency.  Various 
allowable  combinations  of  number  of  digitized  samples 
and  sampling  times  are  given,  which  yield  valid  cepstra 
up  to  20  msec. 
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1.  INTRODUCTION 

In  1966,  Noll  [1]  oresented  a  method  for  determining  the  pitch  period  of 
voiced  speech  based  upon  the  cepstrum.  Briefly,  the  cepstrum  is  the  power 
spectrum  of  the  logarithm  of  the  power  spectrum  and  has  a  strong  peak  at  the 
pitch  period  of  the  voiced-speech  segment  being  analyzed.  It  is  shown  here 
that  just  as  the  spectrum  has  a  folding  frequency  about  which  the  spectrum  is 
symmetric,  the  cepstrum  exhibits  a  similar  behavior,  and  care  must  be  exercised 
in  selecting  the  proper  number  of  sample  points  and  sampling  period  for  the 
calculation  of  valid  cepstra. 


2.  THE  CEPSTRUM 


A  simplified  model  for  the  production  of  voiced-speech  sounds  is  shown  in 
Figure  1. 


1 

Vocal 

Vocal 

j  Source 

Tract 

Voiced 

Speech 


Figure  1.  Model  for  the  Production  of 
Voiced  Speech 

The  vocal  source  produces  a  signal  g(t)  consisting  of  periodic  puffs  of  air 
that  travel  through  the  vocal  cords.  If  h(t)  is  the  impulse  response  of  the 
vocal  tract,  then  voiced  speech,  represented  by  f(t),  is  the  convolution  of 
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of  g  and  h: 

f(t)  =  (g*h) (t) 

Let  F(o)),  G(u>)  and  H(w)  denote  the  Fourier  transforms  of  f(t),  g(t),  and  h(t), 
respectively: 

F(w)  »2f(t)j  m J'  f(t)e  iutdt  (i  ■  /-l) 

G(w)  -5  g(t)|  and  H(a>)  -$jh(t)}  . 

Then,  using  the  fact  that  the  Fourier  transform  of  a  convolution  is  the  product 
of  the  Fourier  transforms,  we  have  that 
F(oj)  -  G(w)  •  H(w). 

For  voiced-speech  sounds,  g(t)  is  quasiperiodic;  thus,  f(t)  is  also  quasi- 
periodic.  Hence,  if  the  period  P  seconds,  then  the  spectrum  |F(u>)|  of  the 
speech  signal  consists  of  harmonics  spaced  ^  Hz  apart.  This  is  illustrated  in 
Figure  2,  which  is  the  spectrum  of 

f(t)  *  sin2iro)t  +  .5sin4rwt  +  sin6rwt  +  .SsinlOirait, 
where  a)  =  500  Hz.  The  "periodicity",  >ccurring  at  500  Hz  intervals,  is 

shown  in  the  figure.  Note  that 

1  cycle 

500  Hz  =  500  cps  ■  — - 

2  msec. , 

so  that  P  ■  2  msec.,  as  expected. 


This  periodicity  in  the  spectrum  can  be  obtained  by  taking  the  Fourier 
transform  of  the  spectrum.  We  would  then  expect  that  the  transformed  spectrum 
would  have  a  peak  corresponding  to  the  period.  The  transformed  spectrum  is 
a(T)  -?j|F(«>)|2j, 


O 
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which  Is  the  autocorrelation  function  of  f(t)  (see  [2],  pp.  241-242).  Using 
the  fact  that 

F(w)  -  G(u>)  •  H(u>) 

gives  us 

a(t)  «3||G(ai)|2  •  |H(o))|2| 

-5||G(a))|2}*^{|H(a))|2} 

*  ag(t)  *  ^(t). 

where  *  again  denotes  convolution  and  a  (t)  and  a^C?)  are  the  autocorrelation  func- 

8 

tions  of  g(t)  and  h(t)  respectively.  Since  the  effects  of  g(t)  and  h(t)  are  con¬ 
volved  in  the  autocorrelation  function  a(t),  it  is  difficult  to  obtain  a  clear 
peak  in  a(t)  resulting  from  the  periodicity  of  g(t)»  as  desired.  This  prob¬ 
lem  is  circumvented  by  taking  the  logarithm  of  the  spectrum: 
log  |F (tu)  |2  -  log[  |G(u>)  |2  •  |H<ui)  | 2 ] 

-  log  j G (co)  |  2  +  log  |H(u>)  | 2 

and  then  taking  the  Fourier  transform  of  the  result: 

3-  jlog |F(u>)  |2|  -3|log|G(o))  j2|  +  3|log|H(o))  |2|. 

The  effects  of  g(t)  and  h(t)  are  now  combined  additively,  rather  than  being 
convolved  as  in  a(x).  In  addition,  the  function  c  defined  by 
c(t)  “31  jlog  | F (u>)  |2| 

has  a  strong  peak  corresponding  to  the  effects  of  g(t)  and  a  broader  peak 
corresponding  to  h(t),  since  the  effect  of  g(t)  is  to  produce  high-frequency 
ripples  in  the  spectrum,  while  h(t)  produces  the  low-frequency  formant 
structure.  If  c(t)  is  squared,  [c(t)]2  -  [?(log  | F(co)  |2l]2 ,  then  the  peak  in 
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c(t)  is  made  even  more  pronounced,  and  it  is  this  last  expression  which  is 
referred  to  as  the  "cepstrum".  The  variable  t  in  the  cepstrum  has  units  of 
cycles  per  hertz,  or  seconds.  These  units  are  commonly  referred  to  as 
"quef rencies" ,  and  we  shall  use  this  terminology  throughout.  The  cepstrum 
corresponding  to  the  spectrum  in  Figure  2  is  shown  in  Figure  3.  Note  the 
strong  peak  corresponding  to  the  pitch  period  of  2  msec. 


An  alternate  definition  of  the  cepstrum  is  obtained  by  using  the  inverse 
Fourier  transform  after  taking  the  logarithm: 
c(x)  1|log|  F (to)  |  2  |  . 

This  definition  is  more  computationally  stable  than  the  others  in  its  dis¬ 
crete  form,  and  it  is  the  one  we  adopt. 


3.  THE  CEPSTRAL  FOLDING  QUEFRENCY 


For  digital  computation  of  the  cepstrum,  we  need  the  notion  of  the  discrete 

Fourier  transform  (DFT) :  if  x^,  x^ ,  ...  xn_^  is  a  discrete  time  series,  then 

the  DFT  of  x_,  x, ,  ...  x  .is 
0  1  n-1 

N-l 


xr  exp(-2uink/N) 


for  k*0,l . N-l,  where  i  ■  /-l.  The  inverse  discrete  Fourier  transform 


(IDFT)  of  XQ,  X1 . XN_1  is 

n-l 

xn  ”  I  \  exP  (2irink/N) 

k-0 


P 


msec 
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for  n-0,1 . N-l.  Thus, 

xn  -  IDFT |dFT (xn)|  (n-0,1,  ...,N-1). 

Now  suppose  that  a  continuous  voiced-speech  signal  is  sampled  every  T  seconds; 

e.g.,  for  a  sampling  rate  of  10,000  samples/sec.,  T  ■  .0001.  Assume  also  that 

N  samples  are  obtained: 

x  -  f (nT)  (n-0,1,  ...,N-1) 
n 

Note  that  the  Nyquist  folding  frequency  is  . 

The  spectrum  of  Xq  ,  x^ ,  ....  x^  ^  is 

Sk  -  20  log^lXjJ  (k-0,1,  ...,  N-l) 

where 

N-l 

X  exp(-27rink/N) 

n=*0 

and  the  constants  are  chosen  to  make  the  relative  energy  level  in  db.  The 
discrete  cepstrum  is  now  defined  to  be 
c(nT)  -  RejlDFT(Sk)j 

(1  N"1 

=  Re(N  X]  Sk  exP(2irink/N} 
k-0  ’ 

1  N-l 

SN  Sk  cos (2Trnk/N) 

k-0 

for  n=0,l,  ...  N-l.  Thus,  c(nT)  is  defined  at  the  N  discrete  quefrencies 
0,  T,  2T . (N-l)T.  By  the  folding  quefrency  we  mean  the  quefrency  about 


which  c(nT)  is  symmetric. 
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NT 

Theorem.  The  folding  quefrency  of  the  cepstrum  c(nT)  is  — . 

Proof.  By  definition, 

N-l 

c(nT)  -  jj-  ^  !  S^cos(27Tnk/N) . 
k-0 


Thus, 


N-l 


+  mT)  -  |  J  Sk  cos[2tr(-|-  +  m)k/N] 
k-0 


N-l 


.  i  ^  Sk  cosfrrk  +  2irak/N] 
k^O 


Also, 


N-l 

(-l)k  cos  (27rmk/N) . 

k-0 


.NT 

C(T 


-  mT) 


-i  v 

N 


k-0 

N-l 


cosUirCy  -  m)k/N] 


i  ^  cos  [irk  +  27rmlv/n] 
N  k-0 


N-l 


N  ^  Sk  (-1^  cos(27rmk/N). 


k-0 


/NT 

c(—  +  mT), 


and  the  result  is  proved. 


In  order  to  obtain  usable  cepstra,  N  and  T  must  be  selected  so  that  the  time 

NT 

intp.val  from  t  =  0  to  t  -  —  is  sufficiently  large  to  encompass  the  pitch 
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/ 


periods  of  most  speakers •  An  interval  from  0  to  20  msec  is  usually  adequate, 
and  Table  1  gives  some  appropriate  combinations  of  N  and  T  which  meet  this 
specification: 

Table  1.  Combinations  of  N  &  T  for  valid  cepstra 


N 

Maximum  number  of 
Samples/sec 

128 

3200 

256 

6400 

512 

12,800 

1024 

25,600 

2048 

51,200 
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